Abish
Malik,
Purdue University [Primary Contact], amalik@purdue.edu
Shehzad
Afzal,
Purdue University [Primary Contact], safzal@purdue.edu
Erin
Hodgess,
University of Houston – Downtown [Faculty Advisor], HodgessE@uhd.edu
David S.
Ebert,
Purdue University [Faculty Advisor], ebertd@purdue.edu
Ross
Maciejewski,
Purdue University [Faculty Advisor], rmacieje@purdue.edu
Our work
utilized and extended work done by
the Purdue University Visual Analytics Center’s work on healthcare
analysis. This system is designed to
facilitate the visual analytics process on categorical spatiotemporal
data. As such, we utilize a
preprocessing step in which emergency department chief complaints are
categorized through the University of Pittsburgh’s CoCo classifier [1]. The tool we developed utilizes linked
geographic and temporal views for exploring disease spread. Underlying the views, we apply analytical
algorithms based on typical control charting methods for time-series
anomaly
detection over the categorical chief complaints. These
analytical algorithms feed back into
the visualization as glyphs within the time-series, denoting when
anomalous
temporal shifts have occurred in the data.
These temporal shifts represent deviations from the expected
values and
indicate a need to investigate the current status of the data. System components include the map view for
spatial data visualization, line chart graph views for visualizing
temporal
health signals of categorized chief complaints and death records, a
stacked
graph view for analyzing record linkages between patient visits and
deaths, and
a statistical summary window providing details on the illnesses by age,
gender
and chief complaint. All views are
linked to an interactive time slider for animation, exploration and
analysis.
[1]
Chapman, Wendy W, Dowling, John N and
Wagner, Michael M (2005), "Classification of emergency department
chief
complaints into 7 syndromes: a retrospective analysis of 527,228
patients",
Ann Emerg Med, 46, 5:
445--455. Downloaded from http://prdownloads.sourceforge.net/openrods/CoCo_batch_3.zip?download
Video:
VACCINATED.avi
ANSWERS:
MC2.1:
Analyze the
records you have been given to characterize the spread of the disease. You should take into consideration symptoms
of the disease, mortality rates, temporal patterns of the onset, peak
and
recovery of the disease. Health
officials hope that whatever tools are developed to analyze this data
might be
available for the next epidemic outbreak.
They are looking for visualization tools that will save them
analysis
time so they can react quickly.
Given a
set of hospital admittance records, we first categorize the data
into syndromes (Botulinic, Constitutional, Gastrointestinal,
Hemorrhagic,
Neurological, Rash, Respiratory, and Other) using the CoCo classifier
[1]. Categorized data was then ingested
into our
system, and time series plots of the categories were analyzed using an
Exponentially
Weighted Moving Average (EWMA) control chart with a 99% confidence
interval
upper bound. We visualized each
syndromic category, by country, as a line graph plot in our system. EWMA is automatically applied to any temporal
plot generated by our system and alerts are visualized as red circles
on the
time series plots. We visually explore
each of the categories, by country, by hospital admittance records and
find
that two categories (gastrointestinal – Figure 1 and hemorrhagic) have
an
outstanding number of temporally contiguous alerts in eight out of the
eleven
given spatial locations (Aleppo, Colombia, Iran, Lebanon, Nairobi,
Saudi
Arabia, Venezuela and Yemen). Only Thailand, Turkey and Karachi seem
unaffected.
Figure 1:
Line graphs
of the gastrointestinal + hemorrhagic category plotted as patient
counts per
day by country.
Exploring the gastrointestinal and hemorrhagic categories, we drill
down into
the hospital admittance records in order to determine disease symptoms. Our toolkit automatically bins patient
admittance free text fields. Our system
provides a linked summary statistics window showing the top five most
numerous
complaints per day by selected category.
By interactively scrolling through the alerts, we are able to
see that
the most common hemorrhagic and gastrointestinal admittance record
fields corresponding
to alerts are: 'vomiting', 'abdominal pain', 'diarrhea', 'fever' and
'nose bleed'. Figure 2 shows our
summary statistics windows containing admittance record information and
other
automatically calculated statistical summaries of the data.
Figure 2:
Summary
statistic view.
We further
investigate the deaths by exploring the overall categorized
death trends in our linked stacked graph view, Figure 3.
Deaths, by category, totaled by all selected
countries, are visualized on a single plot.
As we scroll through time, a texture overlay is plotted to show
links
between the current day’s patients and the future times in which any
subset of
these patients died. From this, we can
determine the approximate amount of time it would take a patient,
showing signs
of gastrointestinal or hemorrhagic syndromes, to die.
We find that from hospital admittance to
reported death, the average time span in which a patient may succumb to
an
illness is ~8 days. Here, we also notice
that a large number of patients categorized as ‘other’ are also dying. We go back to the summary statistics window
and explore the primary admittance text fields associated with the
‘other’
category and find that patients in this category have symptoms
including: 'nausea', 'cough', 'headache vomiting', 'back pain', and
'abd pain'. This indicates problems in
handling misspelled fields during classification, thus, we further
refine our
disease symptoms to include signals misclassified as ‘other’.
Figure 3: A
stacked graph view of total deaths from all selected countries.
Next, we
analyzed the time between the onset and peak of the
disease. In the line graph
visualization, the user interactively measures the time from the first
alert to
the approximate peak of the outbreak.
The user clicks anywhere within the line graph window, and drags
the
mouse. This has an effect of creating a
tape measure that provides a distance between two points with respect
to the
temporal axis. Use of this tool is shown
in Figure 4, and the user approximates that from the onset of the
disease to
the peak is approximately 10-12 days.
Then, we extend the tape measure over the entire series of
alerts and
see the approximated mortality and attack rate for each country. For mortality rate, we sum the number of
patients that died presenting symptoms of gastrointestinal and
hemorrhagic over
this time period and divided by the total number of patients seen with
gastrointestinal and hemorrhagic syndromes over this time period. This number represents an approximate
mortality rate as we ignore patients categorized as other that may
still be
demonstrating signs of infection. The
average mortality rate is ~10%. The
average attack rate is calculated as a function of the number of
patients with
selected syndromes over the total number of patients seen in that time
span. The average attack rate is ~26%.
Figure 4: Using the ‘tape
measure’ tool for calculating attack and mortality rates.
Finally,
we analyze the death total by syndrome for each country to
further strengthen our hypothesis. We
utilize our line graph view for death rates, with an applied EWMA. We see a trending Gaussian death rate for the
eight previously mentioned countries, and we appear to pick up a
similar large
death toll in Karachi. We hypothesize
that for the Karachi records, the short time span is unable to
adequately model
the disease. This results in a false negative, as the death curve
seems to
indicate
the presence of an outbreak as shown in the alert noted in Figure 5
(the first red dot on the left). However,
Turkey and Thailand show no signs of
outbreak.
Figure 5: A line graph view of deaths per day by
country.
To summarize, we find that this particular disease outbreak likely
corresponds to the following hospital admittance text fields: 'vomiting',
'abdominal pain', 'diarrhea', 'fever' and 'nose bleed' with other
potential indicators being 'back pain'. Further, from initial detection of the
disease, it appears that the outbreak peaks within 10-12 days. Mortality rates 10% with patients succumbing
to the illness over the course of ~8 days and attack rates are ~26%. Further, it appears that the pandemic cycle
lasted approximately 22 days from the first alert.
Only Turkey and Thailand appear unaffected in
this data set with Karachi death totals implying the need for further
investigation.
MC2.2:
Compare
the outbreak across cities. Factors to
consider include timing of outbreaks, numbers of people infected and
recovery ability
of the individual cities. Identify any
anomalies you found.
In
comparing outbreaks across countries and cities, we again looked at
alerts generated through the EWMA control charting method.
Previously, we had determined that the
syndromic categories of interest were hemorrhagic and gastrointestinal. To compare outbreaks across cities and
countries, we utilize the map view visualization component of our
system. As syndromic time series data is
often noisy,
we aggregate the data by week and explore the data using a choropleth
map view
where color maps to the percent of patients with hemorrhagic or
gastrointestinal syndromes over the total number of patients seen. Figure 1 shows a screen shot of our map view
visualization
component. As we scroll through time, we
can hypothesize that the disease progresses from Nairobi City in Kenya
to Aleppo
in Syria, Lebanon and Iran. From there,
it seems to progress to other Middle Eastern countries (Saudi Arabia
and Yemen)
to South America, appearing in Colombia and Venezuela.
Figure 1:
Choropleth
map view of countries as a percentage of the patients showing
gastrointestinal
or hemorrhagic illnesses out of the total number of patients aggregated
by
week. When only city data is available
for a country, we visualize the country color based on the syndromic
percentages from a given city.
We then
utilize the line graph views of syndromes by country to determine
the earliest alerts generated by EWMA to determine the approximate
timing of
the outbreaks by day. In Nairobi City, an initial alert is generated on
5-7-2009, Aleppo on 5-10-2009, Iran 5-11-2009, Lebanon 5-12-2009, Saudi
Arabia
5-11-2009, Yemen 5-10-2009, Colombia 5-11-2009, and Venezuela 5-11-2009. Next, we can also utilize the line graph
views of death by country and we again see the trend of deaths in
Karachi
matches the trend of deaths found in cities where statistically
significant
alerts are generated. From this set of
data, we hypothesize that the origin of the outbreak is likely to have
been
Karachi. We see the first major increase
in deaths in Karachi on 5-6-2009. Given
the 8 day lag between infection and death, we can hypothesize that an
outbreak
may have begun in Karachi as early as 4-28-2009.
To further
analyze the number of people infected and the recovery ability
of the city, we began exploring the data as a function of the
underlying
population. When analyzing the data in
this manner, we began discovering other anomalies.
While the underlying city and country
populations are ingested into our system, the following analyses were
done by
hand simply as a means to quickly summarize the data set with respect
to the
city/country population (although attack and mortality rates are found
using our system).
The city of Karachi (population of 11.6 million), while showing no
syndromic alerts, experienced a large population loss during this
study. There were 165,606 total records
of deaths indicating that 1.4% of the city population died in a 14 day
period. Based on the sudden rise in deaths, it is
likely that this city is having difficulty coping with the pandemic and
stopped reporting data.
Nairobi City has a population of about 3.5
million people. From the death records,
we see that 43,719 people died during the testing phase, which is
roughly 1.25%
of the city. The infection rate was
31%. Thus Nairobi as one of the initial
points of the pandemic and is heavily impacted.
The
population of Aleppo is approximately 1.6 million people and we
received approximately 1 million hospital records over a 32 day period,
indicating ~63% of the population of Aleppo recorded a hospital visit
during
this time period. Further, when
analyzing the number of deaths over this time period, we find that 4.9%
of the
population died during that time frame. As
in Karachi, the sudden rise in deaths likely crippled city
infrastructure thus stopping the reporting data.
Most of
the remaining countries displayed a hospital admittance rate of
around 1.0%. However, Saudi Arabia had a
rate of 3.7%, while Lebanon's rate was 10.5%.
The infection rate from gastrointestinal and hemorrhagic was 22%
for
Saudi Arabia and 24% for Lebanon.